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1 Introduction 

In recent years. There has been a variety of research 
on discourse parsing, particularly RST discourse 
parsing (Feng and Hirst, 2014; Li et al., 2014b; Ji 
and Eisenstein, 2014; Joty and Moschitti, 2014; Li et 
al., 2014a). Most of the recent work on RST parsing 
has focused on implementing new types of features 
or learning algorithms in order to improve accuracy, 
with relatively little focus on efficiency, robustness, 
or practical use. Also, most implementations are not 
widely available. 

Here, we describe an RST segmentation and pars¬ 
ing system that adapts models and feature sets from 
various previous work, as described below. Its ac¬ 
curacy is near state-of-the-art, and it was developed 
to be fast, robust, and practical. For example, it can 
process short documents such as news articles or es¬ 
says in less than a second. 

The system is written in Python and is 
publicly available at https://github. 
com/EducationalTestingService/ 
discourse-parsing. 

2 Tasks and Data 

We address two tasks in this work: discourse seg¬ 
mentation and discourse parsing. Discourse seg¬ 
mentation is the task of taking a sequence of word 
and punctuation tokens as input and identifying 
boundaries where new discourse units begin. Dis¬ 
course parsing is the task of taking a sequence of dis¬ 
course units and identifying relationships between 
them. In our case, the set of these relationships form 
a tree. 


For both, we follow the conventions encoded in 
the RST Discourse Treebank (Carlson et al., 2002). 
Here, we give a brief overview of the corpus. See 
Carlson et al. (2001) for more information. 

The treebank uses a representation where dis¬ 
course is represented as a tree, with labels on nodes 
indicating relationships between siblings. Most RST 
relationships have a nucleus, expressing the core 
content, and a satellite that contributes additional 
information to the nucleus. Probably the simplest 
example is the “attribution” relationship: attributed 
(e.g., quoted) text is labeled as the nucleus, and text 
indicating the source of the attributed text is labeled 
as the satellite, with an “attribution” subcategoriza¬ 
tion. 

The leaves of the RST trees are “elementary dis¬ 
course units” (EDUs), which are contiguous spans 
of tokens roughly similar to indepedent clauses. 
Most branching in RST trees is binary, with one 
satellite and one nucleus, though there are some rela¬ 
tions that have multiple nuclei and no satellite (e.g., 
lists). 

The RST corpus consists of a training set of 347 
documents and a test set of 38 documents. The texts 
in the RST treebank are a subset of those in the Penn 
Treebank (Marcus et al., 1993). For this reason, 
we retrained the syntactic parser used in our sys¬ 
tem, ZPar (Zhang and Clark, 2011), on the subset 
of the Penn Treebank WSJ sections 2 to 21 texts not 
present in the RST treebank. 

For development of the system, we split the train¬ 
ing set into a smaller subset for model estimation 
and a development validation set similar in size (n = 
40) to the RST treebank test set. 


3 Discourse Segmenter Description 


In this section, we describe and evaluate the dis¬ 
course segmentation component of the system. Our 
discourse segmenter is essentially a reimplementa¬ 
tion of the baseline system from Xuan Bach et al. 
(2012). We do not implement their reranked model, 
which is more complex to implement and probably 
less efficient, and we use the ZPar parser (Zhang and 
Clark, 2011) for automatic syntactic parsing. 

3.1 Segmenter Model and Features 

Following Xuan Bach et al. (2012), we model RST 
as a tagging problem. Specifically, for each foken 
in a sentence, the system predicts whether that to¬ 
ken is the beginning of a new EDU or the contin¬ 
uation of an EDU. Eor this task, we use a condi¬ 
tional random field (Eafferfy ef ah, 2001) model wifh 
^2 regularizafion, using fhe CRE-i-i- implemenfafion 
(https : //crfpp. googlecode . com). Also, 
we assume fhaf a new sentence always sfarfs a new 
EDU, regardless of fhe CRE oufpuf. 

The CRE uses simple word and POS feafures as 
well as synfacfic feafures. The word and POS fea¬ 
fures are as follows (note that by “word”, we mean 
word or punctuation token): 

• the lowercased form of the current word 

• the part-of-speech (POS) of the current word 

The syntactic features are based on automatic 
parses from ZPar (using a retrained model as dis¬ 
cussed in §2). Eor each of the following nodes in 
the syntactic tree, there are two features, one for the 
nonterminal symbol and the head word (e.g., “VP, 
said”), and one for the nonterminal symbol and the 
head POS (e.g., “VP, VBD”). Note that these fea¬ 
tures will not be used for the last token in a sentence 
since there is no subsequent token. 

• Np-. the first common ancestor of the current 
token and the subsequent word 

• the subtree of Np that contains the current word 

• the subtree of Np that contains the subsequent 
word 

• the parent of Np 

• the right sibling of Np 



P 

R 

FI 

CRPSeg 

91.0 

87.2 

89.0 

Bach-etal-2012 (Base) 

91.4 

90.1 

90.7 

Bach-etal-2012 (Reranking) 

91.5 

90.4 

91.0 

our system 

90.2 

83.5 

86.7 

Human agreement 

98.5 

98.2 

98.3 


Table 1: Discourse segmentation performance in terms 
of percentages precision (“P”), recall (“R”), and FI score 
(“FI”) for the “B-EDU” tag. 


All of these features are extracted for the current 
word, the previous 2 words, and next 2 words in the 
sentence. 

3.2 Segmenter Evaluation 

Following Xuan Bach et al. (2012), we evaluate 
segmentation performance using the gold standard 
EDUs from the RST treebank test set, using the El 
score for the tag indicating the beginning of a new 
EDU (“B-EDU”). Since new sentences always be¬ 
gin new EDUs, we exclude the first tag in the output 
(always “B-EDU”) for each sentence. We first tuned 
the CRE regularization parameter using grid search 
on the split of the training set used for development 
evaluations, using a grid of powers of 2 ranging from 
1/64 to 64. 

The results are shown in Table 1. Eor compari¬ 
son, we include previous results, including human- 
human agreement, reported by Xuan Bach et al. 
(2012), using syntax from the Stanford Parser (Klein 
and Manning, 2003a) (it is not clear from the paper 
what parsing model was used). The “CRPSeg” re¬ 
sults are for the system from Hernault et al. (2010). 

We are uncertain as to the cause for the observed 
differences in performance, though we hypothesize 
that the differences are at least partially due to dif¬ 
ferences in syntactic parsing, which is a key step in 
feature computation. 

4 Discourse Parser Description 

In this section, we describe our RST parser. It 
borrows extensively from previous work, especially 
Sagae (2009).' 

'Note that we do not include Sagae (2009) in our evaluations 
since only within-sentence parsing performance was reported in 
that paper. 







4.1 Shift-Reduce Approach 

Following Sagae (2009) and Ji and Eisenstein 
(2014), we use an “arc standard” shift-reduce ap¬ 
proach to RST discourse parsing. 

4.2 Parsing Model 

The parser maintains two primary data structures: 
a queue containing the EDUs in the document that 
have not been processed yet, and a stack of RST 
subtrees that will eventually be combined to form 
a complete tree. 

Initially, the stack is empty and all EDUs are 
placed in the queue. Until a complete tree is found 
or no actions can be performed, the parser iteratively 
chooses to perform shift or reduce actions. The shift 
action creates a new subtree for the next EDU on the 
queue. 

Reduce actions create new subtrees from the sub¬ 
trees on the top of the stack. There are multiple types 
reduce actions. Eirst, there are unary or binary ver¬ 
sions of reduce actions, depending on whether the 
top 1 or 2 items on the stack will be included as 
children in the subtree to be created. Second, there 
are versions for each of the nonterminal labels (e.g., 
“satellite: attribution”). 

Eollowing previous work, we collapse the full set 
of RST relations to 18 labels. Additionally, we bina¬ 
rize trees as described by Sagae and Eavie (2005). 

Eollowing Sagae (2009) and Ji and Eisenstein 
(2014), we treat the problem of selecting the best 
parsing action given the current parsing state (i.e., 
the stack and queue) as a classification problem. 
We use multi-class logistic regression with an 
penalty, as implemented in the scikit-learn package, 
to estimate our classifier. 

The parser supports beam search and A:-best pars¬ 
ing, though we use simple greedy parsing (i.e., we 
set the beam size and A: to 1) for the experiments 
described here. 

4.3 Parsing Features 

To select the next shift or reduce action, the pars¬ 
ing model considers a variety of lexical, syntactic, 
and positional features adapted from various previ¬ 
ous work on RST discourse parsing, such as that of 
Sagae (2009) and the systems we compare to in §5. 
The features are as follows: 


• the previous action (e.g., “binary reduce to 
satellite:attribution”) 

• the nonterminal symbols of the nth subtree on 
the stack (n = 0,1, 2), and their combinations 

• the nonterminal symbols of the children of the 
nth subtree on the stack (n = 0,1, 2) 

• the lowercased words (and POS tags) for the 
tokens in the head EDU for the nth subtree on 
the stack (n = 0,1) and the first EDU on the 
queue 

• whether, for pairs of the top 3 stack subtrees 
and the 1st queue item, the distance (in EDU 
indices) between the EDU head is greater than 
n (n = 1, 2,3,4) 

• whether, for pairs of the top 3 stack subtrees 
and the 1st queue item, the head EDUs are in 
the same sentence 

• for the head EDUs of top 3 stack subtrees and 
the 1st queue item, the syntactic head word 
(lowercased), head POS, and the nonterminal 
symbol of the highest node in the subtree 

• syntactic dominance features between pairs of 
the top 3 stack items and 1st queue item, similar 
to (Soricut and Marcu, 2003) 

• for each of the first 3 stack items or 1st queue 
item, whether that item starts a new paragraph 

5 Parsing Experiments 

Eollowing (Marcu, 2000, pp. 143-144) and other re¬ 
cent work, we evaluate our system according to the 
El score over labeled and unlabeled spans of dis¬ 
course units in the RST treebank test set. This eval¬ 
uation is analogous to the evalb bracket scoring 
program commonly used for constituency parsing 
(http: //nip. cs . nyu . edu/evalb/). Eor 
comparison with previous results, we use gold stan¬ 
dard discourse segmentations (but automatic syntac¬ 
tic parses from ZPar). 

We report El scores for agreement with the gold 
standard on unlabeled EDU spans (“span”), spans 
labeled only with nuclearity (“nuclearity”), and fully 
labeled spans that include relation information (“re¬ 
lation”). 

We first tuned the li regularization parameter us¬ 
ing grid search on the split of the training set used 



syntax 

span 

nuclearity 

relation 

our system 

ZPar (retrained) 

83.5 

68.1 

55.1 

Li et al. (2014a) 

Stanford 

84.0 

70.8 

58.6 

Joty et al. (2013) 

Charniak (retrained) 

82.5 

68.4 

55.7 

Joty and Moschitti (2014) 

Charniak (retrained) 

- 

- 

57.3 

Leng and Hirst (2014) 

Stanford 

85.7 

71.0 

58.2 

Li et al. (2014b) 

Penn Treebank 

82.9 

73.0 

60.6 

Ji and Eisenstein (2014) 

MALT 

81.6 

71.0 

61.8 

Human agreement 

- 

88.7 

77.7 

65.8 


Table 2: Test set discourse parsing performance in terms of FI scores (%), using gold standard discourse segmentation, 
“syntax” indicates the source of POS tags and syntactic parse trees: “Stanford” refers to the Stanford parser (Klein 
and Manning, 2003b), “MALT” refers to Nivre and Marsi (2007), and “Charniak” refers to Charniak (2000). 



syntax 

span 

nuclearity 

relation 

our system 

ZPar (retrained) 

83.5 

69.3 

57.4 

our system 

PTB 

84.7 

71.2 

59.4 


Table 3: Development set discourse parsing performance in terms of FI scores (%), using gold standard discourse 
segmentation, “syntax” indicates the source of POS tags and syntactic parse trees. 


for development evaluations, using a grid of powers 
of 2 ranging from 1/16 to 16. We selected the set¬ 
ting that led to the highest FI score for fully labeled 
spans (i.e., relation FI). 

We compare to recently reported results from Ji 
and Eisenstein (2014) (their DPLP general -i-features 
model), Feng and Hirst (2014), Li et al. (2014b), 
Joty and Moschitti (2014), Li et al. (2014a), and Joty 
and Moschitti (2014).^ The results are shown in Ta¬ 
ble 2. The human agreement statistics were origi¬ 
nally reported by Ji and Eisenstein (2014). Lor each 
system, the table indicates the source of POS tags 
and syntactic parse trees (“Penn Treebank” means 
that gold standard Penn Treebank trees and tags 
were used). 

We observe that our system is relatively close to 
the others in terms of LI scores. We hypothesize that 
the differences in performance are at least partially 
due to differences in syntactic parsing. 


^Joty and Moschitti (2014) and Joty and Moschitti (2014) do 
not explicitly state the source of syntactic parsers, but we infer 
from Joty et al. (2012) that the Charniak (2000) parser was used, 
with a model trained on a subset of the Penn Treebank that did 
not include the RST treebank test set. 


5.1 The effect of automatic syntax parsing 

In order to show the effect of using automatic pars¬ 
ing, we report performance on the development set 
(§2), using either gold standard syntax trees from the 
Penn Treebank or the automatic syntax trees from 
our retrained ZPar model (§2) for computing fea¬ 
tures. The LI scores are shown in Table 3 (note that 
we are reporting results using the optimal settings 
from grid search on the development set). 

It appears that the performance difference be¬ 
tween using automatic rather than gold standard syn¬ 
tax is about 1 to 2 points of LI score. 

5.2 Parsing Speed 

In this section, we evaluate the speed of the parser. 
Most previous papers on RST parsing do not re¬ 
port runtime experiments, and most systems are not 
widely available or easy to replicate. 

Our parser uses a shift-reduce parsing algorithm 
that has a worst-case runtime that is linear in the 
number of EDUs. Lor comparison, Li et al. (2014b) 
employ a quadratic time maximum spanning tree 
parsing approach. The approach from Joty et al. 
(2013) also uses apolynominal runtime algorithm. 

Other linear time parsers have been developed 
(Leng and Hirst, 2014; Ji and Eisenstein, 2014). 
However, feature computation can also be a per- 















formance bottleneck. Feng and Hirst (2014) report 
an average parsing time of 10.71 seconds for RST 
treebank test set documents (and 5.52 seconds for a 
variant) on a system with “four duo-core 3.0 GHz 
processors”, not including time for preprocessing 
or discourse segmentation. In contrast, our system 
takes less than half a second per test set document 
on average (mean = 0.40, S.D. = 0.40, min. = 0.02, 
max. = 1.85 seconds) on a 2013 MacBook Pro with 
an i7-4850HQ CPU at 2.30 GHz. Of course, these 
performance measurements are not completely com¬ 
parable since they were run on different hardware. 
The preprocessing (ZPar) and segmentation (§3.1) 
steps are also similarly fast. 

6 Conclusion 

In this paper, we have presented a fast shift- 
reduce RST discourse segmenter and parser. The 
parser achieves near state-of-the-art accuracy and 
processes Penn Treebank documents in less than 
a second, which is about an order of magnitude 
faster than recent results reported by Feng and Hirst 
(2014). 
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